9 research outputs found

    FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs

    Get PDF
    In this paper, we developed a simulation-based architecture evaluation framework for field-programmable gate arrays (FPGAs), called FPGA-SPICE, which enables automatic layout-level estimation and electrical simulations of FPGA architectures. FPGA-SPICE can automatically generate Verilog and SPICE netlists based on realistic FPGA configurations and a high-level eTtensible Markup Language-based FPGA architectural description language. The outputted Verilog netlists can be used to generate layouts of full FPGA fabrics through a semicustom design flow. SPICE simulation decks can be generated at three levels of complexity, namely, full-chip-level, grid-level, and component-level, providing different tradeoff between accuracy and simulation time. In order to enable such level of analysis, we presented two SPICE netlist partitioning techniques: loads extraction and parasitic net activity estimation. Electrical simulations showed that averaged over the selected benchmarks, the grid-/component-level approach can achieve 6.1x/7.5x execution speed-up with 9.9%/8.3% accuracy loss, respectively, compared to the full-chip level simulation. FPGA-SPICE was showcased through three different case studies: 1) an area breakdown analysis for static random access memory-based FPGAs, showing that configuration memories are a dominant factor; 2) a power breakdown comparison to analytical models, analyzing the source of accuracy loss; and 3) a robustness evaluation against process corners, studying their impact on energy consumption of full FPGA fabrics

    A Resistive Random Access Memory Addon for the NCSU FreePDK 45 nm

    No full text

    Circuit Designs of High-Performance and Low-Power RRAM-Based Multiplexers Based on 4T(ransistor)1R(RAM) Programming Structure

    No full text
    Routing multiplexers based on pass-transistors or transmission gates are an essential components in many digital integrated circuits. However, whatever structure is employed, CMOS multiplexers have two major limitations: 1) their delay is linearly related to the input size; 2) their performance degrades seriously when operated in near-Vt regime. Resistive Random Access Memory (RRAM) technology brings opportunities of overcoming these limitations by exploiting the properties of RRAMs and associated programming structures. In this paper, we propose new one-level, two-level and tree-like multiplexers circuit designs using 4T(ransistors)1R(RAM) elements and we compare them to naive one-level multiplexers. We consider the main physical design aspects associated with 4T1R-based multiplexers, such as the layout implications using a 7 nm FinFET technology, and the co-integration of low-voltage nominal power supply and high-voltage programming supply. Electrical simulations show that using a 7 nm FinFET transistor technology, the proposed 4T1R-based multiplexers reduce delay by 2x and energy by 2.8x over naive 4T1R and 2T1R counterparts. At nominal working voltage, considering an input size ranging from 2 to 50, the proposed 4T1R-based multiplexers reduces Area-Delay and Power-Delay products by 2.6x and 3.8x respectively, as compared to best CMOS multiplexers. In the near-Vt regime, the proposed 4T1R-based multiplexer demonstrates 2x larger delay efficiency over the best CMOS multiplexer. The proposed 4T1R-based multiplexers operating at near-Vt regime can still achieve up to 22% delay improvement when compared to best CMOS multiplexers working at nominal voltage

    Post-P&R Performance and Power Analysis for RRAM-based FPGAs

    No full text
    Resistive Random Access Memory (RRAM)-based FPGAs are predicted to outperform conventional FPGAs architectures in area, delay and power over a wide range of voltage operations, allowing novel energy-quality trade-offs for reconfigurable computing. The opportunity lies in that RRAMs can realize the functionality of a Static Random Access Memory (SRAM) and a transmission-gate in a unique device. However, most of predictive analysis shown in the state-of-the-art are achieved by using analytical models. Unfortunately, while analytical models have been intensively refined for conventional FPGA architectures, their accuracy on RRAM-based FPGAs has not been carefully investigated. Consequently, misleading conclusions may be caused by using inaccurate analytical models. In this paper, we rely on electrical simulations and semi-custom design tools to perform detailed area and power comparison between SRAM-based and RRAM-based FPGAs. To enable accurate analysis, we develop a synthesizable Verilog generator for both SRAM-based and RRAM-based FPGAs and also enhance FPGASPICE to support most recent advanced RRAM-based circuits and FPGA architectures. The area analyses are based on fullchip layouts of SRAM-based and RRAM-based FPGAs, which are produced by a semi-custom design flow. We consider a full FPGA fabric, including core logic, configuring peripherals and I/Os, which is more realistic than analytical models. The power analysis are based on SPICE simulation results by considering the twenty largest MCNC benchmarks. Simulation results identify that the target RHRS of RRAM-based FPGAs should be at least 20M Ω to guarantee energy improvements over SRAM-based FPGAs. Experimental results present that at nominal working voltage, RRAM-based FPGAs can improve up to 8% in area, on average 22% in delay and on average 16% in power respectively, as compared to SRAM-based counterparts. Compared to SRAMbased FPGAs working at nominal voltage, near-Vt RRAM-based FPGAs can outperform close to 2 × in Energy-Delay Product without delay overhead. As a result, RRAM-based FPGAs are more capable of trading-off energy and quality than the SRAM-based counterparts

    Physical Design Considerations of One-level RRAM-based Routing Multiplexers

    No full text
    Resistive Random Access Memory (RRAM) technology opens the opportunity for granting both high-performance and low- power features to routing multiplexers. In this paper, we study the physical design considerations related to RRAM- based routing multiplexers and particularly the integration of 4T(ransistor)1R(RAM) programming structures within their routing tree. We first analyze the limitations in the physical design of a naive one-level 4T1R-based multiplexer, such as co-integration of low-voltage nominal power supply and high voltage programming supply, as well as the use of long metal wires across different isolating wells. To address the limitations, we improve the one-level 4T1R-based multiplexer by re-arranging the nominal and programming voltage domains, and also study the optimal location of RRAMs in terms of performance. The improved design can effectively reduce the length of long metal wires by 50%. Electrical simulations show that using a 7nm FinFET transistor technology, the improved 4T1R-based multiplexers improve delay by 69% as compared to the basic design. At nominal working voltage, considering an input size ranging from 2 to 32, the improved 4T1R-based multiplexers outperform the best CMOS multiplexers in area by 1.4Ă—, delay by 2Ă— and power by 2Ă— respectively. The improved 4T1R-based multiplexers operating at near-Vt regime can improve Power-Delay Product by up to 5.8Ă— when compared to the best CMOS multiplexers working at nominal voltage

    A product engine for energy-efficient execution of binary neural networks using resistive memories

    No full text
    The need for running complex Machine Learning (ML) algorithms, such as Convolutional Neural Networks (CNNs), in edge devices, which are highly constrained in terms of computing power and energy, makes it important to execute such applications efficiently. The situation has led to the popularization of Binary Neural Networks (BNNs), which significantly reduce execution time and memory requirements by representing the weights (and possibly the data being operated) using only one bit. Because approximately 90% of the operations executed by CNNs and BNNs are convolutions, a significant part of the memory transfers consists of fetching the convolutional kernels. Such kernels are usually small (e.g., 3Ă—3 operands), and particularly in BNNs redundancy is expected. Therefore, equal kernels can be mapped to the same memory addresses, requiring significantly less memory to store them. In this context, this paper presents a custom Binary Dot Product Engine (BDPE) for BNNs that exploits the features of Resistive Random-Access Memories (RRAMs). This new engine allows accelerating the execution of the inference phase of BNNs. The novel BDPE locally stores the most used binary weights and performs binary convolution using computing capabilities enabled by the RRAMs. The system-level gem5 architectural simulator was used together with a C-based ML framework to evaluate the system's performance and obtain power results. Results show that this novel BDPE improves performance by 11.3%, energy efficiency by 7.4% and reduces the number of memory accesses by 10.7% at a cost of less than 0.3% additional die area, when integrated with a 28 nm Fully Depleted Silicon On Insulator ARMv8 in-order core, in comparison to a fully-optimized baseline of YoloV3 XNOR-Net running in a unmodified Central Processing Unit

    FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs

    No full text

    Resistive random access memory based multiplexers and field programmable gate arrays

    No full text
    Resistive random access memory (RRAM) based multiplexers and field programmable gate arrays (FPGAs) are provided. The RRAM-based multiplexers and FPGAs include a 4T1R programming structure to program the RRAMs. The programming structure includes two programming transistors connected between the power supply and the top electrode of the RRAM and two programming transistors connected between the power supply and the bottom electrode of the RRAM. The programming transistors are used to set and rest the RRAMs. In the RRAM-based multiplexer programming transistors connected to the bottom electrodes are shared between a plurality of RRAMs. The shared programming transistors and an output inverter of the RRAM are provided in a deep N-well of the RRAM-based multiplexer. The programming transistors connected to the top electrodes of the RRAMs and a plurality of input inverters are provided in a regular well of the RRAM-based multiplexer

    A Product Engine for Energy-Efficient Execution of Binary Neural Networks Using Resistive Memories

    No full text
    The need for running complex Machine Learning (ML) algorithms, such as Convolutional Neural Networks (CNNs), in edge devices, which are highly constrained in terms of computing power and energy, makes it important to execute such applications efficiently. The situation has led to the popularization of Binary Neural Networks (BNNs), which significantly reduce execution time and memory requirements by representing the weights (and possibly the data being operated) using only one bit. Because approximately 90% of the operations executed by CNNs and BNNs are convolutions, a significant part of the memory transfers consists of fetching the convolutional kernels. Such kernels are usually small (e.g., 3×3 operands), and particularly in BNNs redundancy is expected. Therefore, equal kernels can be mapped to the same memory addresses, requiring significantly less memory to store them. In this context, this paper presents a custom Binary Dot Product Engine (BDPE) for BNNs that exploits the features of Resistive Random-Access Memories (RRAMs). This new engine allows accelerating the execution of the inference phase of BNNs. The novel BDPE locally stores the most used binary weights and performs binary convolution using computing capabilities enabled by the RRAMs. The system-level gem5 architectural simulator was used together with a C-based ML framework to evaluate the system’s performance and obtain power results. Results show that this novel BDPE improves performance by 11.3%, energy efficiency by 7.4% and reduces the number of memory accesses by 10.7% at a cost of less than 0.3% additional die area, when integrated with a 28nm Fully Depleted Silicon On Insulator ARMv8 in-order core, in comparison to a fully-optimized baseline of YoloV3 XNOR-Net running in a unmodified Central Processing Unit
    corecore